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Abstract 

In many areas of science one aims to estimate latent sub-population mean curves based 
only on observations of aggregated population curves. By aggregated curves we mean linear 
combination of functional data that cannot be observed individually. We assume that several 
aggregated curves with linear independent coefficients are available. More specifically, we assume 
each aggregated curve is an independent partial realization of a Gaussian process with mean 
modeled through a weighted linear combination of the disaggregated curves. We model the 
mean of the Gaussian processes as a smooth function approximated by a function belonging to 
a finite dimensional space Tlx which is spanned by K B-splines basis functions. We explore 
two different specifications of the covariance function of the Gaussian process: one that assumes 
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a constant variance across the domain of the process, and a more general variance structure 
which is itself modelled as a smooth function, providing a nonstationary covariance function. 
Inference procedure is performed following the Bayesian paradigm allowing experts' opinion to 
be considered when estimating the disaggregated curves. Moreover, it naturally provides the 
uncertainty associated with the parameters estimates and fitted values. Our model is suitable 
for a wide range of applications. We concentrate on two different real examples: calibration 
problem for NIR spectroscopy data and an analysis of distribution of energy among different 
type of consumers. 

Keywords: Bayes' theorem; B-splines; Covariance function; Gaussian process. 
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1 Introduction 



The problem we address is the estimation of latent sub-population mean and covariance curves 
when only populational aggregated data is available. By aggregated data we mean that each sample 
consists of linear combinations of functional data that cannot be observed individually for each 
sub-population. 

Certainly, there are many methods of fitting curves to data. A collection of techniques known as 
nonparametric regression, for example, allows great flexibility in the possible form of the regression 
curve a. In particular, it assumes no parametric form for it. In fact, a nonparametric regression model 
only makes the assumption that the regression curve belongs to some infinite collection of curves. 
Consequently, in order to propose a nonparametric model one may just need to choose an appropriate 
space of functions where he/she believes that the regression curve lies. This choice, usually, is 
motivated by the degree of smoothness of a. Then, one uses the data to determine an element of 
this function space that can represent the unknown regression curve. Consequently, nonparametric 
techniques rely more heavily on the data for information about a than their parametric counterparts. 
Also, this flexibility on the form of the curve allows one to incorporate prior information. The 
literature on nonparametric regression is vast, for the interested reader we refer to the book of 
Eubank (1999). 

The set up for nonparametric regression assumes that an unknown function a of one or more 
variables and a set of measurements j/i, . . . , are such that: 



where £i, . . . , £„ are linear functionals defined on some linear space 'H, containing a, and ei, . . . , e„ 
are measurement errors usually assumed to be independent, with common, zero mean normal dis- 
tributions with unknown variance . Typically, the Li will be point evaluations of the function a. 
That is, Lig = g{xi) and yi = y{xi), where Xj are the explanatory variables for i = 1, . . . , n. 

The problem we address here is more general. We have several unknown functions Oc, c = 1, . . . , C 
of one or more variables and the set of measurements are given by 



yi = Cia + Si 



c 





c=l 
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where £jc; i = 1, c = 1, . . . , C are the hnear functionals. The problem is to estimate the 
functions ac, c = 1, . . . ,C based on the measurements yi,i = 1, . . . , I . 

More specifically, in our model we assume each aggregated curve t/i is an independent partial 
realization of a Gaussian process with mean modeled through a weighted linear combination of the 
disaggregated curves acS. Following Dias, Garcia and Martarelli (2009) we model the mean of the 
Gaussian process as a smooth function approximated by a function belonging to a finite dimensional 
space Hk which is spanned by K B-splines basis functions. This is not the only choice, other basis 
could be used such as Fourier expansion, wavelets, natural splines. See, for example, Silverman 
(1986), Kooperberg and Stone (1991), Vidakovic (n.d.), Dias (1998) and Dias (2000). 

In this work, differently from Dias et al. (2009), we consider two different structures for the 
covariance function of the Gaussian process postulating models that impose positive definiteness 
condition on the function. The first one assumes a uniform structure across the domain of the process, 
the second one models the covariance itself as a smooth function providing a nonstationary behaviour. 
Our model is going to be applied for two practical situations. In both of them it is reasonable to 
consider that the experts in the field have prior information on the disaggregated curves. Therefore, 
in order to incorporate this opinion, inference procedure will be performed following the Bayesian 
paradigm. As a by-product, we naturally obtain the uncertainty associated with the parameters 
estimates, and fitted values. 

This paper is organized as follows. Section [2] presents two motivating examples: calibration 
problem for NIR spectroscopy data and an analysis of distribution of energy among different types 
of consumers. Section [3] describes our proposed hierarchical model to estimate latent disaggregate 
curves when only aggregated population observations are available. Therein we also propose a non- 
stationary covariance function allowing the variance of the underlying process to smoothly change 
across the domain of the function. Next section analyzes different sets of artificial data. The aim is to 
check the ability of the model in recovering the true disaggregated functions under different scenarios. 
Then Section |5] discusses the analysis for the two motivating examples described in subsections 12.11 
and 12.21 Finally, Section |6] concludes. 
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2 Motivating examples 



2.1 Near-infrared (NIR) spectroscopy data: a calibration problem 

Analyzing materials to determine their chemical composition is a basic tool of science. It is used in 
many applications such as food safety testing, protein detection, pharmaceutical purposes, forensics, 
to mention just a few. This analysis can be done directly in the laboratory by usually expensive and 
time-consuming techniques. An alternative is to determine the chemical composition through NIR 
spectroscopy which is a low cost technique, relatively simple to use and provides adequate accuracy 
in many practical situations. Some references with practical applications of NIR spectroscopy are 
Candolfi, De Maesschalck, Massart, Hailey and Harrington (1999); Rodriguez- Saona, Khambaty, Fry 
and Calvey (2001); Maraboli, Cattaneo and Giangiacomo (2002); Tewari, Mehrotra and Irudayaraj 
(2003); Cozzolino, Flood, Bellon, Gishen and De Barros Lopes (2006); Schonbrodt, S., Winter and 
G. (2006); Saranwong and Kawano (2008a, 2008b); Woodcock, O'Donnell and Downey (2008); 
Botonjic-Sehic, Brown, Lamontagne and Tsaparikos (2009); and Romia and Bernardez (2010). For 
introductory material on the subject see Shenk and Westerhaus (1991), Brereton (2003), and Burns 
and Ciurczac (2007). 

When atoms or molecules absorb light, the energy input excites a quantized structure to a higher 
energy level. The type of excitation depends on the wavelength of the light. NIR spectroscopy 
technology is based on the fact that each of the major chemical components of a sample has near 
infrared red absorption properties in the region 700-2500 nm. An absorption spectrum is the ab- 
sorption of light as a function of wavelength. The NIR spectrum of a sample is the summation of 
these absorption properties for each chemical sample resulting in a continuous curve measured by 
modern scanning instruments at hundreds of equally spaced wavelengths. The information contained 
in this curve can be used to predict the chemical composition of the sample. The problem lies in 
extracting the relevant information from possibly thousands of overlapping peaks. This can be ac- 
complished by applying the Beer-Lambert Law. The Beer-Lambert law is the linear relationship 
between absorbance and concentration of absorbing species. 

Analysing a training set of different samples with distinct compositions allow us to calibrate the 
analysis. Osborne, Fearn, and Hindle (1993) described applications in food analysis and reviewed 
some of the standard approaches to the calibration problem. Multivariate calibration techniques 
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are widely used in the literature for this kind of problem. We propose a different approach by 
treating this problem in the framework of functional data analysis. We regard the response of 
interest as an aggregated continuous curve observed only at a set of discrete points. Therefore, 
having measurements of the NIR spectrum for several chemical samples, with distinct compositions, 
will allow us to estimate the typical curve for each constituint of the sample. Figure 12.11 shows 
the absorbance curves measured for a dataset of 10 polyaromatic hydrocarbons (PAH) obtained 
by Electronic Absorption Spectroscopy. The sample consists of 25 chemical samples, each sample 
composed of varying compositions of 10 different constituents (pyrene, acenaphthene, anthracene, 
acenaphthylene, chrysene, benzanthracene, uoranthene, uorene, naphthalene, phenanthracene) . Each 
sample was submitted to 27 wavelengths (220nm-350nm). This dataset was presented by Brereton 
(2003) to illustrate multivariate calibration techniques. 



PAH Dataset 




I I \ I \ I \ 
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Figure 2.1: Polyaromatic hydrocarbons spectra 

2.2 Electric load 

The distribution of electric energy is done in several stages: first substations provide energy for 
regions in the city. This energy arrives at power transformers {trafo, an usual acronym for trans- 
former) that redistributes it to micro-regions. Each micro-region is composed of different types of 
consumers, residential, commercial, industrial, among others. For each type of consumer, there are 
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peaks of consumption at certain hours of the day. In Brazil, for example, it is empirically known that 
residential consumers have a peak on energy consumption between 6-8pm (due partially to the use 
of electric showers) and commercial and industrial consumers have their peak between 8am-6pm. To 
avoid overload, trafos have to be designed to deal with the maximum load of the day. Ideally, the 
distribution of electric energy should be done in such a way that there is a constant load during the 
whole day, all days of the week, all over the year for all power plants, substations and transformers. 
Therefore, to have a more efficient and uniform distribution of electricity, it is necessary to know the 
profile of the consumption for each type of consumer. For each type of consumer, this typical curve 
is called the typology. The empirical evidence described before might be used as prior information 
when modeling the typology for each type of consumer. 

Prom a practical point of view, it is very difficult and expensive to obtain samples from individual 
consumers. Commonly, the data available are aggregated data from power transformers (trafos). 
Typically each trafo comprises around 50 consumers. Notice that this data is the sum of all load 
demanded by the market (the number of consumers of each type) of this trafo. Moreover, due to 
billing issues, the market of each trafo is known. Therefore, having measurements of the electric 
load for several different trafos, with distinct markets, provides us with the information to estimate 
the individual curves for each type of consumer. Figure [2]2] shows the data from two trafos, that we 
analyse in Section [2] 

TR07 TR09 
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Figure 2.2: Observed curves for two trafos. 



Both problems described above can be viewed as examples of samples obtained from aggregated 
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data; that is, the observed data might be described as a hnear combination of individual functional 
processes and the aim is to estimate their individual mean and covariance functions. 

3 Proposed Model 

Let Yij{t) be the j-th replication of curve i observed at point t, t E [0,Tp]. We decompose lij(-) 
as the sum of two components. The first one is described as a weighted sum of C smooth curves, 
each representing the mean curve of a category c (c = 1, ■ ■ ■ ,C). For example, in the electric load 
example, each disaggregated curve represents the typical curve of consumer type c. The second 
component represents measurement error, described by a zero mean Gaussian process with some 
covariance function. More specifically, we assume 

c c 

''^iji't) = ^ricac{t) + ^eijc{t), z = 1,...,/, J = 1,..., J, (3.1) 

c=l c=l 

where ai(t), . . . , ac(t) are the mean curves related to category c = 1, 2, ■ ■ ■ ,C, respectively. The 
TjcS are assumed known and are related to the problem being investigated. These will be discussed 
in detail in Section [51 We assume £ijc{-) follows independent zero mean Gaussian processes with 
covariance function given by Zic{t, s), for t,s E [0,Tp]. 

In general, the required degree of smoothness depends on the problem under study. However, 
it is common to require that the functions ttc belong to the Sobolev space 'H2 = {/ • [^^a,^;?,] — t- 
^! X]j=o )^ ^ '^}- '^'^ consider the class as the set of possible mean curves is natural 
and desirable for this particular situation since it can be well approximated by a finite-dimensional 
approximating space generated by cubic B-splines, see de Boor (1978). Therefore, the second level 
of hierarchy expands the mean curves, ad-), as a linear combination of B-spline basis. We assume 
there exists a positive integer K and a knot sequence ^ = {^1, ■ ■ ■ , ^k) such that 

K 

«c(t) = Y.f^-'^^kit), (3.2) 

k=l 

where Bk{t), k = 1, . . . , K are cubic B-splines. Consider the function is evaluated at T points, with 
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ti < t2 < ■ ■ ■ < ^T, following equation fl3.2p . we can write 



BKih) 
BKitr) ) 



/ft. ^ 



(3.3) 



for c= 1,2,-- - ,C. 



Notice that the design matrix in equation (13. 3 p does not depend on the category c since we are 
using the same number of basis and same knot allocation for all categories. Moreover, in this model 
the coefficients do not depend on the sampled points and all points of all aggregated curves can be 
used to estimate the same C x K coefficients. Therefore, following equation (13. ip and the discussion 
above, we have the following linear model 

c K 

^vit) = 5^ 5^r*c/3cfe5fc(t) + Sijit), (3.4) 



c=l k=l 



where £ij{t) = X]^i^jjc(^)' because of the independence assumption of ejjc(') for i = 1, 



, J, and c 



C. We now discuss in detail the covariance structure among the ejj(-)s. 



3.1 Covariance structure of the measurement error 

The measurement error captures any structure left after adjusting the data to the sum of the latent 
disaggregated curves ad-)- As we are estimating functions we assume the errors are correlated across 
the domain of Yij{-). 

We expect the correlation between points Yij{t) and Yij{s) to decay exponentially, as |t — s| 
increases. For each category c, we assign an exponential correlation function with decay parame- 
ter 0c > for the Gaussian process associated to each eijc{-)- Our main contribution lies on the 
specification of the variance structure. For the i-th curve, let Zj(t, s) = Cov {e i j (t), Sij{s)), be the 
covariance between points t and s. Notice that we assume the same covariance structure across 

replicates j = 1, ■ ■ ■ , J. We propose the following general structure for Zi(t, s), 

c 

Zi{t, s) = ^ Cicr]c{t) r]c{s) exp(-</)c|t - s\), (3.5) 

c=l 

where Cic, c = 1, ■ ■ ■ ,C, i = 1, ■ ■ ■ , I are known constants. Like the constants in equation (13.41) . 
the CicS assume values related to the problem being studied. For every i, we allow the variances to 
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change with t, such that Zi{t,t) = X]c=i ^tcVcit)'^- More generally, the covariance function is allowed 
to change along the domain of the function. Because of the product ric{t) ric{s) in equation fl3.5p 
we do not need to impose any particular restriction on the ?7c(-)s to guarantee that we have a valid 
covariance function. It is worth noting that this covariance function might assume negative values, 
depending solely on the function ?7c(-)- 

We consider three different models for the components 'ric{-)s: 

(a) Uniformly homogeneous case: In this case we assume 

r]c(t) = a, 'it and (pc = 4>, 
implying that Zi{t, s) = {Ylc=i ^ic)^'^ exp(— 0|t — s|), for all c = 1, . . . , C. 

(b) Homogeneous case: Here we relax the assumption of common o"^ and </> by assuming 

which leads to Zi{t, s) = Ylc=i Cic'^l exp(— 0c|^ ~ i.e. 

(c) Heterogenous case: The more general case expands ?7c(-) in B-splines basis functions, such 
that 

L 

Vc{t) = ^e^Bt{t), (3.6) 

1=1 

where Be{t), i = 1, ■ ■ ■ , L are cubic B-splines with (possibly) a different knot sequence from that 
for ac{-), say and the covariance function follows the general structure shown in Equation 
dSS]). 

3.2 Likelihood function and Prior specification 

Assume y represents the JJT- dimensional vector of observations, with components y = (yii(ti), ■ ■ ■ , 
yuiti), ■ ■ ■ , yuitr), ■ ■ ■ , UuitT), ■ ■ ■ , UuitT)). Considering the more general case, denote by © the 
parameter vector for the model. We will specify this vector for the three covariance structures 
considered. Notice that for each i = I,-- - ,/, and j = 1, ■ ■ ■ , J, conditioned on the parameter 
vector, Yij = (Yij(ti), ■ ■ ■ jYijitx))' follows independent normal distributions such that 

Y,, ~iVT(X,/3,Z,), (3.7) 
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where Xi are T x CK matrices given by 



riiBxiti) 



ricBxiti) 



\ 



\ 



ruBiitr) ... ruBKitx) ... ricBiitr) ... ricBxitT) J 



(3 = (/3i, ■ ■ ■ ,(3cy, with f3c = (/3ci,-"" ^Pck)', is the CK dimensional vector of coefficients, and 
Zi = Zi{@) are covariance matrices of order T with elements given by Equation (13. 5p . Therefore, 
based on the observed vector y, the likehhood function for can be written as 



i=i j=i ' 

As our inference procedure follows the Bayesian paradigm, we now specify the prior distribution of 
the parameter vector depending on the covariance structure. 

Prior specification We assume prior independence among the components of the parameter vector 
0. In particular, for the coefficients /3c we assign /T- dimensional multivariate normal distributions 
with known mean vector be and covariance matrice f2c, c = 1, 2, ■ ■ ■ , C In Section [5] we assume a 
zero mean prior for /3c. However, experts in the field of interest might provide useful information 
about the shape of each function, and this can be induced through the mean of the prior distributions 
of the respective /3cS. For the covariance matrices fic we assume diagonal matrices, with the diagonal 
elements fixed at some large value to let the observed data drive the inference procedure. The prior 
specification of the parameters in the covariance function is related to the choice of the covariance 
structure proposed in Section 13.11 

(a) Uniformly homogeneous case: in this case the parameter vector is defined as 0^^ = cr^, 0). 
For 0"^ we assume an inverse gamma distribution with shape parameter d and rate parameter 
/. For 0, we assign a gamma prior distribution with shape parameter p and rate parameter q 
fixed at some reasonable value. For example, we can use the idea of practical range. The mean 
of the prior, p/g can be fixed such that at a reasonable distance, the correlation is close to 
zero, say 0.05. More specifically, we fix the mean at the value that solves 0.05 = exp(— 
where 0* is the prior mean guess we need, and dist is a fixed distance. 
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(b) Homogeneous case: for the homogenous covariance structure we define the parameter vector 

as ©"^ = af, (T2, ■ ■ ■ , a^, 01, 02, ■ " " , 0c)- We suggest independent inverse gamma prior 
distributions, each with parameters {dc,lc), for each o"^. We also assume prior independence 
among the decay parameters 0c of the exponential correlation function; and each one is assumed 
to follow a gamma prior distribution with parameters Pc and Qc, c = 1, ■ ■ ■ ,C, with and Qc 
fixed at some reasonable values. The same idea of practical range discussed in the uniformly 
homogenous case can be used here. 

(c) Heterogeneous case: the parameter vector to be estimated is Q^^ = (/3, 6i, - ■ ■ , Oc, 0i, ■ ■ ■ , 0c)- 

We assume independent i^-dimensional multivariate normal prior distributions for the coeffi- 
cients 6c, each with known mean vector dc, and covariance matrix Ac. Like in the mean values 
of /3, the dcS can be obtained by experts in the field. However, it might be more challenging to 
elicitate these values as they are in the covariance structure of the process. For the covariance 
matrices Ac we assume diagonal matrices, with the diagonal elements fixed at some reasonably 
large value to let the observed data drive the inference procedure. 

Posterior distribution and inference procedure Following the Bayesian paradigm, the poste- 
rior distribution, p{@ | y), is proportional to the likelihood function times the prior distribution of 

e. 

The resultant posterior distributions under all different covariance functions do not have closed 
forms. We use Markov chain Monte Carlo (MCMC) methods, specifically, the Gibbs sampler with 
some steps of the Metropolis-Hastings (M-H) algorithm to obtain samples from the target posterior 
distribution (see e.g. Gamerman and Lopes (2006)). In particular, the full conditional posterior dis- 
tributions of (3c are normal distributions, which are easy to sample from. Independent of the assumed 
covariance function, the full conditional posterior distributions of each of the parameters involved in 
it do not result in known distributions. For these parameters we make use of the Metropolis-Hastings 
algorithm with log-normal proposals based on the current value of the chain, and some fixed vari- 
ance, tuned to give reasonable acceptance rates. The MCMC algorithm was implemented in R (R 
Development Core Team, 2010), and the codes are available from the authors upon request. 
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3.3 Predictive inference 

From a Bayesian point of view, one can obtain the posterior predictive distribution of the function 
Yi{-) at unobserved values of the domain of the function, Y* = (Fi(t^), ■ ■ ■ ,Yi{tl)), for t^* G [0, Tp], / = 
1, ■ ■ ■ ,L, through 



p(yny)= f p{y:\y,@)p{@\y)d@. 



(3.9) 



The model assumes that samples Yi{-) are being generated from the multivariate normal distri- 
bution, N{Xif3, Zi). From the theory on the multivariate normal distribution (Anderson, 1984), it 
follows that the joint distribution of Y and Y*, conditioned on 0, is given by 





e ~iV - ; ' '^M , (3.10) 



where X* is a L-dimensional vector with elements equal to the cubic B-splines at point t;*; Xj is a 
vector comprising the cubic B-splines at the observed points tf, Z* is a covariance matrix of dimension 
L and each of its element is the covariance of the process between unobserved points. Each line of the 
matrix Zi-^,^, T x L, represents the covariance between the i^^ observed point and the j^^ unobserved 
one, i = 1, - ■ ■ ,T and j = 1, ■ ■ ■ , L. From the theory of the multivariate normal distribution we have 
that 

Y;|y„ e ~ iVz. {X*f3 + Zl^^Zr' (y. - X,f3) ; Z* - Z[^^Zr^Z.,,,) . (3.11) 

The integration in (13.91) does not have an analytical solution, however approximations can be easily 
obtained through Monte Carlo methods (Gamerman and Lopes, 2006). For each sample s, s = 
1, ■ ■ ■ , Q, obtained from the MCMC algorithm, we can obtain an approximation for (13. 9p . by sampling 
from the distribution in (13.111) and computing 



p(y*|y)^^Ep(y*l®^)- (3.12) 

Once samples from the posterior distribution of are available, realizations from the posterior 
predictive distribution can be obtained by sampling from the distribution of Y* | y, 0'^, with 0'^ 
representing the sth sampled parameter vector 0. 
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4 Analyzing artificial data sets 



Here we analyze six artificial sets of data to check the ability of the model in estimating the disag- 
gregated curves of interest when the truth is known. All datasets assume C = 2 population curves. 
We consider data are generated from 

Yijit) = raaiit) + r.aasl^) + i = 1,2,3, j = 1, ■ ■ ■ , J, (4.1) 

where the true curves are given by 

ai(t) = 5exp{— t} sin(7rt/2) cos(7rt) 

a2{t) = 5exp{-(t-0.2)}cos(7rt/2)sin(7rt), 

with rii = 1, ri2 = 4, r2i = 4, r22 = 1, ''"31 = 2.5 and = 2.5. These curves were chosen because 
they have interesting features to be captured by the model. We explore 6 different scenarios by 
assuming different specifications for the covariance structures of eij(-): 

Case 1: Uniformly homogeneous case In this case we assume all Cic = 1, cr^ = 1 and = 0.5. 

We concetrated on the case where there are no replicates for the aggregated curves, such that J = 1 
and we obtained samples for J = 10 and / = 30. 

Case 2: Homogeneous case Here we assume 

Zi{t, s) = CiiCTi exp(-0i|t - s|) + Ci2(jl exp{-(f)2\t - s\), (4.2) 

and we fix the parameters at the following values: = cr| = 1, 0i = 02 = 4 and Cn = 1, Cu = 1.3, 
C21 = 1.4, C22 = 1-3, C31 = 1.5 and C32 = 1.5. Here we consider only the case J = 15. 

Case 3: Heterogeneous case Here we assume 

Zi(t,s) = Cii?7i(t)?7i(s)exp(-0i|t - s|) + Ci2T]2(t)T]2{s) exp{-(f)2\t - s\) (4.3) 

with T]i and 772 curves generated as linear combinations of B-splines, 0i = 02 = 4 and Cn = 1, 
C12 = 1.3, C21 = 1.4, C22 = 1.3, C31 = 1.5 and C32 = 1.5. For the heterogeneous covariance 
structure we fit the model considering J = 15, J = 50 and J = 150. This is to investigate the 
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effect of the number of replicates on the estimates of the parameters when a more flexible covariance 
structure is assumed. 

For all datasets we assumed 14 B-splines basis with K = 10 internal knots. In the heterogeneous 
case, we assumed this same set of knots to estimate ?7i(.) and ri2{.). We let the MCMC algorithm 
run for 100,000 iterations, considered the first 5,000 as burn-in and kept every 95-th sample to avoid 
autocorrelation between the sampled values. Convergence of the chains was checked through the use 
of two chains starting from very different values. 

For the uniformly homogeneous case we notice that inspite of the value of / the posterior distri- 
bution of cr^ and seem to recover the true values used to generated the data. On the other hand, 
the value of / seems to have influence on the magnitude of the ranges of the 95% posterior credible 
intervals of the disaggregated curves ac(-) (Figures l4. II and . 

For the homogeneous case, even with only J = 15 replicates, the model is able to recover the 
true structure of the disaggregated functions ai{.) and a2(-)- The posterior mean of both curves are 
very close to their respective true values and the range of the 95% posterior credible intervals are 
relatively narrow. The parameters in the covariance structure are also well estimated (Figure l43l) . 

For the heterogenous case it is clear that the number of replicates affect the range of the posterior 
credible intervals both for ad-) and ric{.), c= 1,2. Regardless of the value of J the true values are 
recovered from the inference procedure, specially for the disaggregated functions ac(.) . The greater 
the number of replicates the narrower the 95% posterior credible intervals (Figures 14.41 and l4.5p . The 
true values of the decay parameters 0i and 02 are also recovered from the inference procedure, and 
similar to the results for ad-) and ric{.), the magnitude of J influences the range of the posterior 
credible intervals (Figure US])- 
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1=10 



1=30 





Figure 4.1: Summary of the posterior distribution of the parameters for the uniformly homogeneous 
case with J = 1, / = 10 (first column) and / = 30 (second column). Posterior mean curves 
(solid lines) and limits (shaded area) of the 95% posterior credible intervals. Dashed lines represent 
respective true values. 
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Figure 4.2: Summary of the posterior distribution of the parameters for the uniformly homogeneous 
case with J = 1 and / = 10 (first row) and / = 30 (second row). Posterior distribution of and 
and (j). Dashed hues represent respective true values. 
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Figure 4.3: Summary of the posterior distribution of the parameters for the homogeneous case with 
J = 15. Posterior mean curves (sohd hues) and hmits (shaded area) of the 95% posterior credible 
intervals (1st row), posterior distribution of af and cr| (2nd row), and (pi and 02 (3rd row). Dashed 
lines represent respective true values. 
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Figure 4.4: Posterior mean curves (solid lines) and limits (shaded area) of the 95% posterior credible 
intervals for (a) ai(.), J = 15, (b) a2{.), J = 15, (c) ai(.), J = 50, (d) a2{-), J = 50, (e) ai(.), 
J = 150, (f) a2{-), J = 150. In all panels the dashed line is the respective true curve. 
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Figure 4.5: Posterior mean (solid line) of the variance functions ?7i(.) and f]2{-) (rows) and limits 
(shaded area) of the 95% posterior credible intervals for the heterogeneous case with J = 15, J = 50 
and J = 150 (columns). The dashed lines represent the true curves used to generate the data. 
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Figure 4.6: Posterior summary (solid circle: posterior mean and lines represent the limits of the 
95% posterior credible intervals) of the decay parameters (pi and 02 in equation fl3.5p . Dashed line 
represents true value used to generate the data. 

5 Applications 

5.1 NIR Spectroscopy data - Polyaromatic hydrocarbons 

The Beer-Lambert law for K constituents plus noise states that for the ith chemical sample the 
measurement at wavelength t is given by 

c c 

Y,{t) = Y,nMt) + Y.'''^i^^'^^ tG [220,350], i = l,...,25. (5.1) 

1=1 e=i 

where r^^j is the concentration of the i constituent in the ith chemical sample , ae{t) is the absorbance 
at wavelength t of the ith pure constituent and e^^j is a random noise. 

Notice that the Beer-Lambert formula leads exactly to the functional model of aggregated data 
given by Equation (11. ip . 



J=15 J=50 J=150 
No. of replicates (J) 
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In this example, we used 14 B-spline basis with 10 internal knots equally spaced in the interval 
(220,350) to estimate the latent absorbance curves of each constituint, ad-)- 

Since there are no replicates of the population curves available and we have C = 10 constituints, 
we chose to fit the model considering a homogeneous covariance structure. In this case, we used a 
gamma prior for (p^, and an inverse gamma with parameters 2 and 0.2 for a^, c = 1, . . . , 10. We let 
the MCMC algorithm run for 100,000 iterations, considered the first 5,000 as burn-in and kept every 
100-th sample to avoid autocorrelation between the sampled values. 
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Figure 5.1: Estimated absorbance curves for the constituints, for the PAH dataset under the uni- 
formly homogeneous model. 
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Figure 15.11 shows the estimated individual curves for each of the constituints. However, to see 
how good these estimates are we should compare the aggregated observed curves with the weighted 
sum of these estimates using the Beer-Lambert formula. These comparisons are made in Figures 
15.21 and 15.31 The confidence bands for Figure 15.21 are the quantiles weighted sum of the estimates 
whereas Figure 15.31 compares the data obtained in some chemical samples with the predictive 95% 
confidence intervals. Both figures show that we have an excelent fit for the data. 
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Figure 5.2: Estimated curves for chemical samples 6, 12 ,21 and 24 
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Figure 5.3: Predictive 95% confidence intervals for chemical samples 6, 12 ,21 and 24 
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5.2 Electric load data 

Here we analyze electric load data as described in Section 12.21 In Brazil, for security reasons, houses 
are loaded with energy tension either equal to 127V or 220V. For this reason, they are classified as 
monophasic (single phase/ 127V) or biphasic (two phases/220V). Usually, more modest residencies 
are monophasic suggesting that monophasic and biphasic consumers have different typologies. We 
consider samples observed from 1 = 2 trafos which are denoted by TR07 and TR09. The market for 
each trafo is small and variable, consisting only of single phase and two phase residential consumers. 
The consumption of energy during weekends is different than from weekdays (Figure [2^ . Therefore, 
we decide to analyse only the weekdays resulting that J = 5 replicates. 

Measurements from trafos TR07 and TR09 were stored at every 15 minutes, during 5 days of a 
particular week. It is known that the electric load of each trafo i is equal to the sum of Ni = Nii + Ni2 
curves, where is the number of consumers of type c (monophasic or biphasic here), and {Nn, Ni2) 
is the market of trafo i. Table [5?T] presents the number of consumers (market) for each trafo. 



Trafo 


Single Phase 


Two Phase 


Total 


TR07 


87 


5 


92 


TR09 


25 


25 


50 



Table 5.1: Distribution of the number of consumers (market) for trafos TR07 and TR09 in the electric 
load application. 

Following the general structure in equation (13. ip . we model the traffic load of trafo at day j, 
observed at time t as 

C Ni^ 

^^jW = 5Z ^^^-c^W' ^e[0,24], 2 = 1,2, j = l,...,5, (5.2) 

c=l nc=l 

where Wcj,nc,iit) = ctcit) + ecj,nc,iit)- -^^^ ^^i^ example the constants r^c and Cic (in equations (13. 4p 
and (13. 5p ) coincide and are equal to Nic. 

We fitted the model assuming 14 B-spline basis, with K = 10 internal knots located at the fol- 
lowing points C, = {4, 6, 8, 10, 12, 14, 16, 18, 19, 20}. We fitted the model considering a heterogeneous 
covariance structure. The prior specification of 0c uses the idea of practical range as discussed in 
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Section I3.2[ A priori, we assume the correlation between measurements in a day dies off, in average, 
after 3/4 of an liour. We assume a gamma prior for (pc with mean equals 3/0.75 = 4, and variance 
1. Panels of figure |5^ show the posterior mean with respective 95% posterior credible intervals for 
the mean and variance curves for monophasic and biphasic consumers. As expected the single phase 
houses have a smaller load than the two phase residencies. Both types have a peak around 8pm, as 
this coincides with arriving home from work, taking showers, etc. The basic difference is that for two 
phase residencies there is an increase in the electric load from Sam to 12pm which is not observed 
for the single phase consumers. 

Monophasic Biphasic 




1 — I — I — I — t 1 — \ — \ — \ — I 

5 10 15 20 5 10 15 20 

t t 



Figure 5.4: Posterior mean (solid line) and 95% credible intervals (shaded area) for monophasic 
(single phase) and biphasic (two phase) consumers based on trafos TR07 and TR09 for the tipologies 
(first row) and variance (second row). The dotted lines in the first row show the estimate obtained 
under the frequentist approach proposed by Dias et al (2009). 
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Figure 15.51 compares the posterior predictive distribution with the observations. It is clear that 
the proposed model is capturing quite well the structure of the data, as most of the observations are 
within the limits of the 95% posterior predictive interval. 

TR07 TR09 




5 10 15 20 5 10 15 

Time (h) Time (h) 



Figure 5.5: Summary of the posterior predictive distribution (solid lines) and limits of the 0.025 and 
0.975 quantile curves (shaded area) compared with the respective observed values (hollow circles), 
for trafos TROT and TR09. 



6 Concluding remarks 

In this work our attention was focused on the Bayesian estimation of latent sub-population (disag- 
gregated) mean and covariance curves when we only have available observations of the population 
(aggregated) curves. Although our proposed models are an extension of the model initially proposed 
by Dias et al. (2009), we propose more flexible nonstationary covariance structures for the Gaussian 
process by allowing the variance of the process to change across the domain of the function. The 
general non-parametric case naturally imposes the positive definiteness of the covariance function 
and can be restricted to accomodate many different situations. We believe our proposed covariance 
structure might also be applied in different areas, e.g. geostatistics. 
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There are several advantages of using the Bayesian paradigm: 

• It might naturally incorporate prior information that is available to experts in the field; 

• Estimates are obtained under a single framework; 

• and it naturally provides the uncertainties of the estimates of the latent sub-population curves, 
not only of the mean curves but also of the covariance curves. 

To show the strength of our method, we analyzed two examples from different areas in science: 
environmetrics and chemometrics. In both examples it is clear, from comparing the observed aggre- 
gated curves with the weighted sum of the estimated latent ones, that the proposed model provide 
extremely reasonable estimates. 
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